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Abstract 

Web-based statistical instruction, like all statistical 
instruction, ought to focus on teaching the essence of the research 
endeavor: the exercise of reflective judgment. Using the framework 
of the recent report of the APA Task Force on Statistical Inference 
(Wilkinson & The APA Task Force on Statistical Inference, 1999) , 
the present paper explores background for and potential 
instructional design of Web-based instruction involving (a) effect 
size reporting and interpretation and (b) score reliability 
evaluation. 
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In 1993, Carl Kaestle, prior to his term as President of the 
National Academy of Education, published in the Educational 
Researcher an article titled, "The Awful Reputation of Education 
Research." It is noteworthy that the article took as a given the 
conclusion that educational research suffers an awful reputation, 
and rather than justifying this conclusion, Kaestle focused instead 
on exploring the etiology of this presumed reality. For example, 
Kaestle (1993) noted that the education R&D community is seemingly 
in perpetual disarray, and that there is a 

...lack of consensus — lack of consensus on goals, 
lack of consensus on research results, and lack of a 
united front on funding priorities and 
procedures. . . . [T]he lack of consensus on goals is 
more than political; it is the result of a weak 
field that cannot make tough decisions to do some 
things and not others, so it does a little of 
everything. . . (p. 29) 

Although Kaestle (1993) did not find it necessary to provide a 
warrant for his conclusion that educational research has an awful 
reputation, others have directly addressed this concern. 

The National Academy of Science evaluated educational research 
generically, and found "methodologically weak research, trivial 
studies, an infatuation with jargon, and a tendency toward fads 
with a consequent fragmentation of effort" (Atkinson & Jackson, 
1992, p. 20). Others also have argued that "too much of what we see 
in print is seriously flawed" as regards research methods, and that 
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"much of the work in print ought not to be there" (Tuckman, 1990, 
p. 22). Gall, Borg and Gall (1996) concurred, noting that "the 
quality of published studies in education and related disciplines 
is, unfortunately, not high" (p. 151). 

Indeed, empirical studies of published research involving 
methodology experts as judges corroborate these impressions. For 
example, Hall, Ward and Comer (1988) and Ward, Hall and Schramm 
(1975) found that over 40% and over 60%, respectively, of published 
research was seriously or completely flawed. Wandt (1967) and 
Vockell and Asher (1974) reported similar results from their 
empirical studies of the quality of published research. 
Dissertations, too, have been examined, and have been found 
methodologically wanting (cf. Thompson, 1988, 1994). 

Purpose of the Present Paper 

These troubling realizations have led to some self-scrutiny on 
the part of professors of educational research as regards the 
training we provide to our students. Certainly, in an environment 
where less and less space in curriculum is allocated to 
methodological teaching (cf. Aiken, West, Sechrest, Reno with 
Roediger, Scarr, Kazdin, & Sherman, 1990) , not all these problems 
can be laid at the doors of methodology professors. 

Still, there is clearly some room for improvement in what we 
do. The present paper offers one perspective on potential vehicles 
for improvement. 

Today increasing numbers of faculty are utilizing Web-based 
instructional tools to facilitate research training. Some 
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applications allow students, for example, to "drag" data points in 
histograms or scattergrams , and watch the associated incremental 
changes in statistical indices. Applications such as these provide 
a user-friendly environment in which students can readily ask 
"what-if" questions and explore statistical dynamics. 

One important skill that students must master is recognizing 
the various rival hypotheses that may explain the results in the 
literature they review, or in their own research. One way to teach 
such skills is to present synopses or excerpts from actual studies 
in a Web environment, and then allow students to enter "chat rooms" 
to offer alternative explanations for detected effects. Given the 
frailties of the human reviewer system that guards the gates of the 
publication citadel (Peters & Ceci, 1982), students must learn 
early to evaluate critically all that they read, or students will 
invariably otherwise rely on published specious claims. 

One potential source of study vignettes is the popular books 
offered by Huck and his colleagues (cf. 2000; Huck & Cormier, 
1996) . Particularly relevant to the current focus are the vignettes 
presented by Huck and Sandler (1979) . 

Huck and Sandler (1979) presented a series of short study 
synopses in which various rival hypotheses might be invoked to 
explain reported results. The reader is then challenged to 
formulate these possibilities, and the back portion of the book 
presents possible alternative study explanations. 

The purpose of the vignettes was characterized as facilitating 
"logical thinking" (p. xvi) , and assisting "people in 
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discriminating possible rival hypotheses and plausible rival 
hypotheses" (p. xiv) . In other words, the purpose of the problems 
and their proposed solutions is to teach students to think and 
evaluate critically the claims in published (or unpublished) 
research! 

The present treatise takes as a given both the utility and the 
import of just such an instructional emphasis. My own teaching is 
similarly focused. However, my pedagogic bias is frankly toward 
Socratic instruction with an emphasis on heuristic techniques 
requiring discovery learning on the part of students, rather than 
toward Web-based instruction, except as a fairly peripheral (but 
powerful) instructional aid. 

Vignettes such as the Huck-Sandler examples might be used in 
Web-based instruction as a tool to help students think 
reflectively. However, my purpose here is to argue that any such 
instruction should be grounded in the contemporary analytic 
principles embedded within the recent report of the APA Task Force 
on Statistical Inference (Wilkinson & The APA Task Force on 
Statistical Inference, 1999) . 

Here I will advocate emphasis on two of these principles. 
Along the way, in each arena I will also cite some related 
illustrative features of Web-based instruction that I would find 
useful . 

Principle #1; Report and Interpret Effect Sizes 
Background 

Statistical significance has a long history (cf. Huberty, 
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1993; Huberty & Pike, 1999). Recently, overreliance on statistical 
tests has been bluntly criticized (cf. Cohen, 1994; Daniel, 1998; 
Schmidt, 1996; Thompson, 1996, 1999c) . For example, Tryon (1998) 
recently lamented, 

[T]he fact that statistical experts and 
investigators publishing in the best journals cannot 
consistently interpret the results of these analyses 
is extremely disturbing. Seventy-two years of 
education have resulted in minuscule, if any, 
progress toward correcting this situation. It is 
difficult to estimate the handicap that widespread, 
incorrect, and intractable use of a primary data 
analytic method has on a scientific discipline, but 
the deleterious effects are doubtless substantial... 

(p. 796) 

Indeed, several empirical studies have shown that many researchers 
do not fully understand the statistical tests that they employ 
(Mittag & Thompson, in press; Nelson, Rosenthal & Rosnow, 1986; 
Oakes, 1986; Rosenthal & Gaito, 1963; Zuckerman, Hodgins, Zuckerman 
& Rosenthal, 1993). 

Of course, even many defenders of statistical tests (cf. 
Abelson, 1997; Cortina & Dunlap, 1997; Frick, 1996; Robinson & 
Levin, 1997; also see Harlow, Mulaik & Steiger, 1997, and reviews 
by Levin, 1998 and Thompson, 1998) agree that the tests have 
sometimes been abused or misinterpreted. One area of agreement 
across many scholars writing on these topics is that researchers 
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ought to report and interpret effect sizes (cf. Kirk, 1996; 
Thompson, 1996) . Snyder and Lawson (1993) explain what effect sizes 
are and summarize the many available choices (e.g. , Cohen's d, eta 2 , 
omega 2 ) . 

In 1996, the APA Board of Scientific Affairs appointed its 
Task Force on Statistical Inference to make recommendations 
regarding whether statistical significance tests should be banned 
from APA journals (Azar, 1997; Shea, 1996) . In its recently 
published article, the Task Force emphasized, " Always provide some 
effect-size estimate when reporting a p value" (Wilkinson & The APA 
Task Force on Statistical Inference, 1999, p. 599, emphasis added). 
Later the Task Force also wrote, 

Always present effect sizes for primary outcomes. . . . 

It helps to add brief comments that place these 
effect sizes in a practical and theoretical 
context. ... We must stress again that reporting and 
interpreting effect sizes in the context of 
previously reported effects is essential to good 
research, (p. 599, emphasis added) 

Of course, the 1994 APA publication manual, incorporated by 
reference into the editorial policies of hundreds if not thousands 
of behavioral science journals, did "encourage" (p. 18) effect size 
reporting. However, as summarized by Vacha-Haase, Nilsson, Reetz , 
Lance & Thompson, in press) , 11 empirical studies of 1 or 2 post- 
1994 volumes of 23 different journals confirm that this 
"encouragement" has been ineffectual (cf. Keselman et al., 1998). 
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Thompson (1999b) explained why the APA "encouragement" has 
been so ineffective. He noted that only "encouraging" effect size 
reporting 

presents a self-canceling mixed-message. To present 
an "encouragement" in the context of strict absolute 
standards regarding the esoterics of author note 
placement, pagination, and margins is to send the 
message, "these myriad requirements count, this 
encouragement doesn't." (p. 162) 

Consequently, various journals now require effect size 
reporting (e.g., Heldref Foundation, 1997, pp. 95-96; Murphy, 
1997) . Such journals include: 

Educational and Psychological Measurement ; 

Journal of Agricultural Education ; 

Journal of Applied Psychology; 

Journal of Consulting and Clinical Psychology; 

Journal of Early Intervention; 

Journal of Experimental Education; 

Journal of Learning Disabilities ; 

Language Learning ; and 
The Professional Educator. 

Editors at these journals will soon ask their editorial boards to 
approve such a requirement: 

Journal of Mental Health Counseling; and 
Research in the Schools. 

Web Instruction on Effect Size-related Concepts 
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A fundamental concept in evaluating effect sizes as against 
statistical significance is the concept that 

The calculated p values in a given study are a 
function of several study features, but are 
particularly influenced by the confounded, joint 
influence of study sample size and study effect 
sizes. Because p values are confounded indices, in 
theory 100 studies with varying sample sizes and 100 
different effect sizes could each have the same 
single ^calculated ^ and 100 studies with the same single 
effect size could each have 100 different values for 
^calculated’ (Thompson, 1999c, pp. 169 — 170) 

There are various Web applications that could be employed to teach 
insights related to this concept. 

One vehicle for such instruction might sequentially present a 
series of different studies, each with a fixed roughly-identical 
single effect size, but different n's and consequently each with 
different p values. Table 12 in my 1999 AERA Invited Address 
(Thompson, 1999a) presents just such a series. Students might then 
be asked both (a) to interpret each study's individual results and 
(b) to interpret the set of studies as a holistic series, as an 
emerging cumulating literature might be interpreted. 

Another alternative would be to present a series of studies in 
which effect sizes and sample sizes varied but that each yielded an 
essentially fixed ^calculated value. Table 13 from Thompson (1999a) 
presents such a series. Again, students might be presented with the 
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same two interpretation challenges. These exercises would force 
students to have the necessary "ah ha" experience related to the 
influences of sample sizes on p values, problems with interpreting 
E values without consulting effect sizes, and the importance of 
effect sizes. 

A related series of vignette presentations might present both 
"uncorrected" (e.g. , eta 2 ) and "corrected" (e.g. , omega 2 ) effect 
sizes. This particular series would help students to understand 
what sampling error variance is and what three factors cause 
sampling error variance (Thompson, 1999a) . 

Principle #2: Evaluate. Report and Interpret Score Reliability 
Background 

In addition to strongly emphasizing the importance of effect 
sizes, the APA Task Force on Statistical Inference also emphasized 
that 

It is important to remember that a test is not 
reliable or unreliable. Reliability is a property of 
the scores on a test for a particular population of 
examinees (Feldt & Brennan, 1989) . Thus, authors 
should provide reliability coefficients of the 
scores for the data being analyzed even when the 
focus of their research is not psychometric. 
Interpreting the size of observed effects requires 
an assessment of the reliability of the scores. 
(Wilkinson & The APA Task Force on Statistical 
Inference, 1999, p. 596) 
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Thompson and Vacha-Haase (2000) present a thorough (i.e., 
protracted) elaboration of these issues. 

Unfortunately, empirical studies indicate that most authors do 
not evaluate and report the reliability coefficients for their own 
data (cf. Meier & Davis, 1990; Snyder & Thompson, 1998; Thompson & 
Snyder, 1998; Vacha-Haase, Ness, Nilsson & Reetz, 1999; Willson, 
1980) . Nor do authors who only merely cite reliability coefficients 
from previous studies even explicitly compare (a) their own sample 
compositions and (b) their own sample score variabilities with 
those in the previous studies, to thus establish that the previous 
coefficients might be generalized (Vacha-Haase & Kogan, in press) ! 

These dismal patterns of practice may occur because many 
researchers may not really understand what score reliability is 
(Thompson & Vacha-Haase, 2000) . Certainly such misperceptions ought 
be expected, given the short shrift afforded measurement training 
in doctoral programs through the United States (Aiken, West, 
Sechrest, Reno with Roediger, Scarr, Kazdin, & Sherman, 1990) . 

Web Instruction on Score Reliability-related Concepts 

Reliability is not a property of a test per se , and rather 
inures to a particular set of scores (Thompson & Vacha-Haase, 
2000) . Reliability is driven by score variability, and the 
generalizability of score reliability is driven by the 
comparability of the composition of samples with the sample 
composition used in a referenced prior reliability (e.g., 
normative) study (Crocker & Algina, 1986, p. 144) . 

The importance of sample variability as regards score 
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reliability might be taught by building an applet to generate pairs 
of scores in ascending order for a fixed sample size, with a random 
number generator adding or subtracting a small random additive 
adjustment to each score in each pair. The applet might request as 
input the desired SD of the scores. The score pairs modeling test- 
retest reliability, for example, would then be generated, and the 
resulting reliability coefficient would be reported. 

Students would then grasp at a deeper level why score 
variability impacts score reliability. In classical measurement 
theory reliability deals with the consistency with which 
individuals are rank ordered by measurement across parallel test 
forms, repeated measurements, and so forth. The degree of the 
homogeneity of the scores (i.e., SDx) directly affects the 
consistency (e.g. , stability) of the score orderings because, as 
Cunningham (1986) explained, 

[W]hen scores are bunched together, a small [random 
measurement error] change in raw score will lead to 
large changes in relative position. If scores are 
spread out (variability is high) , it is more likely 
that the relative position in the group will remain 
stable across the two forms of the test and the 
correlation coefficient will be relatively large. 

(p. 114) 

In other words, "greater differences between the scores of 
individuals reduce the possibility of shifting positions" (Linn & 
Gronlund, 1995, p. 101). 
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The influence of sample composition on score reliability might 
be taught by generating a population scattergram for a test-retest 
situation in which score reliability differed across males and 
females. A Web applet might then allow sampling of different 
numbers of males and females, and report resulting reliability 
coefficients for each sample. Students would see score reliability 
coefficients fluctuate across every variation in sample 
composition. 

Summary 

Good statistical instruction is instruction that teaches 
students to understand dynamics within statistics as different 
characterizations of data. Statistics mastery does not equate with 
the rote memorization of formulae. Rather, the essence of 
conducting research is the exercise of reflective judgment. As 
Huberty and Morris (1988, p. 573) noted: "As in all of statistical 
inference, subjective judgment cannot be avoided. Neither can 
reasonableness ! " 

Web-based statistical instruction, like all statistical 
instruction, ought to focus on teaching the essence of the research 
endeavor: the exercise of reflective judgment. Using the framework 
of the recent report of the APA Task Force on Statistical Inference 
(Wilkinson & The APA Task Force on Statistical Inference, 1999) , 
the present paper explored background for and potential 
instructional design of Web-based instruction involving (a) effect 
size reporting and interpretation and (b) score reliability 
evaluation. 
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